Goto

Collaborating Authors

 East Midlands


A Very Big Fight Over a Very Small Language

The New Yorker

In the Swiss Alps, a plan to tidy up Romansh--spoken by less than one per cent of the country--set off a decades-long quarrel over identity, belonging, and the sound of authenticity. After reformers launched Rumantsch Grischun, a standardized version of Romansh's various dialects, traditionalists denounced it as a "bastard," a "castrated" tongue, an act of "linguistic murder." Ask him how it all began, and he remembers the ice. It was a bitter morning in January, 1982, when Bernard Cathomas, aged thirty-six, carefully picked his way up a slippery, sloping Zurich street. His destination was No. 33, an ochre house with green shutters--the home of Heinrich Schmid, a linguist at the University of Zurich. Inside, the décor suggested that "professor" was an encompassing identity: old wooden floors, a faded carpet, a living room seemingly untouched since the nineteen-thirties, when Schmid had grown up in the house. Schmid's wife served, a Swiss carrot cake that manages bourgeois indulgence with a vegetable alibi. Cathomas had already written from Chur, in the canton of the Grisons, having recently become the general secretary of the Lia Rumantscha, a small association charged with protecting Switzerland's least known national language, Romansh. Spoken by less than one per cent of the Swiss population, the language was itself splintered into five major "idioms," not always readily intelligible to one another, each with its own spelling conventions. Earlier attempts at unification had collapsed in rivalries. In his letter, Cathomas said that Schmid's authority would be valuable in standardizing the language. Cathomas wrote in German but started and ended in his native Sursilvan, the biggest of the Romansh idioms: " ." Translation: "I thank you very much for your interest and attention to this problem." Schmid, the man he was counting on, hadn't grown up speaking Romansh; he first learned it in high school, and later worked on the "Dicziunari Rumantsch Grischun," a Romansh dictionary begun in 1904 and still lumbering toward completion.






Boosting Process-Correct CoT Reasoning by Modeling Solvability of Multiple-Choice QA

Schumann, Raphael, Riezler, Stefan

arXiv.org Artificial Intelligence

Reasoning quality in large language models depends not only on producing correct answers but also on generating valid intermediate steps. We study this through multiple-choice question answering (MCQA), which provides a controlled setting with fixed answer options. Our analysis shows that when questions are effectively unsolvable for a model, spurious chains of thought (CoTs) are more likely to appear, leading to false positives. By estimating the solvability of each question, we uncover an intermediate regime where learning is most effective. Building on this insight, we adapt outcome-supervised reward models and reinforcement learning with group-relative advantage to incorporate solvability into their objectives. Across experiments on math and multimodal datasets, these modifications consistently yield higher rates of process-correct reasoning and, in reinforcement learning, improved answer accuracy as well. Our results highlight solvability as a key factor for reducing hallucinations and increasing reliability in CoT reasoning. In many applications of CoT reasoning, the generated thought process is as important as the final answer. While some tasks provide gold-standard reasoning chains that can effectively be used for supervised training (Nye et al., 2021; Dziri et al., 2023; Hochlehnert et al., 2025), most datasets lack such annotations. For these cases, correct reasoning has to be incentivized by rewards on correct final answers (Wen et al., 2025). It is known that CoTs can lead to the correct answer, despite an incorrect explanation. Grattafiori et al. (2024) note that this often occurs for questions where only a small fraction of the generated answers is correct. In this work, we investigate this observation in controlled experiments on multiple datasets. To avoid confounding factors of noisy answer extraction and matching, we focus on multiple-choice question answering. This format is popular for evaluating models and widely used training sets like NuminaMath (LI et al., 2024) contain a large fraction of multiple-choice questions. The fixed number of answer options also allows us to explicitly model the solvability of a question.


US and UK sign major nuclear power deal: What does it include?

Al Jazeera

US and UK sign major nuclear power deal: What does it include? British Prime Minister Keir Starmer and United States President Donald Trump have signed a multibillion-pound deal to expand nuclear power across both nations. Known as the Atlantic Partnership for Advanced Nuclear Energy, the agreement aims to speed up the construction of new reactors and provide reliable, low-carbon energy for high-demand sectors, including energy-intensive artificial intelligence data centres. Britain's largest energy supplier, Centrica, will pair up with the US firm X-energy to develop up to 12 advanced modular reactors in Hartlepool, a port town in northeast England, which could power 1.5 million homes and create up to 2,500 jobs. US nuclear technology company Holtec, France's state-backed energy giant EDF Energy, and United Kingdom real estate and investment firm Tritax will develop advanced data centres powered by small modular reactors (SMRs) in Nottinghamshire, East Midlands, valued at about 11 billion pounds ($15bn).


FineDialFact: A benchmark for Fine-grained Dialogue Fact Verification

Chen, Xiangyan, Li, Yufeng, Gan, Yujian, Zubiaga, Arkaitz, Purver, Matthew

arXiv.org Artificial Intelligence

Large Language Models (LLMs) are known to produce hallucinations - factually incorrect or fabricated information - which poses significant challenges for many Natural Language Processing (NLP) applications, such as dialogue systems. As a result, detecting hallucinations has become a critical area of research. Current approaches to hallucination detection in dialogue systems primarily focus on verifying the factual consistency of generated responses. However, these responses often contain a mix of accurate, inaccurate or unverifiable facts, making one factual label overly simplistic and coarse-grained. In this paper, we introduce a benchmark, FineDialFact, for fine-grained dialogue fact verification, which involves verifying atomic facts extracted from dialogue responses. To support this, we construct a dataset based on publicly available dialogue datasets and evaluate it using various baseline methods. Experimental results demonstrate that methods incorporating Chain-of-Thought (CoT) reasoning can enhance performance in dialogue fact verification. Despite this, the best F1-score achieved on the HybriDialogue, an open-domain dialogue dataset, is only 0.75, indicating that the benchmark remains a challenging task for future research. Our dataset and code will be public on GitHub.


Air Traffic Controller Task Demand via Graph Neural Networks: An Interpretable Approach to Airspace Complexity

Henderson, Edward, Gould, Dewi, Everson, Richard, De Ath, George, Pepper, Nick

arXiv.org Artificial Intelligence

Real-time assessment of near-term Air Traffic Controller (ATCO) task demand is a critical challenge in an increasingly crowded airspace, as existing complexity metrics often fail to capture nuanced operational drivers beyond simple aircraft counts. This work introduces an interpretable Graph Neural Network (GNN) framework to address this gap. Our attention-based model predicts the number of upcoming clearances, the instructions issued to aircraft by ATCOs, from interactions within static traffic scenarios. Crucially, we derive an interpretable, per-aircraft task demand score by systematically ablating aircraft and measuring the impact on the model's predictions. Our framework significantly outperforms an ATCO-inspired heuristic and is a more reliable estimator of scenario complexity than established baselines. The resulting tool can attribute task demand to specific aircraft, offering a new way to analyse and understand the drivers of complexity for applications in controller training and airspace redesign.


Geospatial Mechanistic Interpretability of Large Language Models

De Sabbata, Stef, Mizzaro, Stefano, Roitero, Kevin

arXiv.org Artificial Intelligence

Large Language Models (LLMs) have demonstrated unprecedented capabilities across various natural language processing tasks. Their ability to process and generate viable text and code has made them ubiquitous in many fields, while their deployment as knowledge bases and "reasoning" tools remains an area of ongoing research. In geography, a growing body of literature has been focusing on evaluating LLMs' geographical knowledge and their ability to perform spatial reasoning. However, very little is still known about the internal functioning of these models, especially about how they process geographical information. In this chapter, we establish a novel framework for the study of geospatial mechanistic interpretability - using spatial analysis to reverse engineer how LLMs handle geographical information. Our aim is to advance our understanding of the internal representations that these complex models generate while processing geographical information - what one might call "how LLMs think about geographic information" if such phrasing was not an undue anthropomorphism. We first outline the use of probing in revealing internal structures within LLMs. We then introduce the field of mechanistic interpretability, discussing the superposition hypothesis and the role of sparse autoencoders in disentangling polysemantic internal representations of LLMs into more interpretable, monosemantic features. In our experiments, we use spatial autocorrelation to show how features obtained for placenames display spatial patterns related to their geographic location and can thus be interpreted geospatially, providing insights into how these models process geographical information. We conclude by discussing how our framework can help shape the study and use of foundation models in geography.